When predicting a binary dependent variable, the output of your model is usually a probability or is easily converted to a probability. Many times it is desirable to convert this probability to a binary variable to match the dependent variable.
For example, if you are predicting whether a customer will buy a product, you may want to convert the probability to buy into a binary prediction of buy/not buy for each consumer in your scoring data set. The default for most algorithms is 0.50. That is, if a consumer has a probability greater than 50% they are predicted as a buyer. If a consumer has a probability less than 50% they are predicted as a non-buyer.
But is .50 the right cut-off? Just because it is the default, does not mean it is necessary the best. In this notebook, we examine a method to determine the best cut-off based on the context of your business problem.
Note that if you are using these methods on a real-world problem, make sure that you determine the best cut-off on your TRAINING DATA SET and confirm the results on your TESTING and VALDIATION data sets.
Before we get too far, I just want to point out that are many different ways to address this problem. I use this method because it works for me, not because it is superior to any other method.
Introducing a Gains Table
3.1 Assign Deciles based on the probability
3.2 Assemble the Gains Table
#!pip install --upgrade numpy
#!pip install plotly --upgrade
!pip install chart-studio --upgrade
import chart_studio.plotly as py
import plotly.graph_objs as go
import numpy.dual as dual
import plotly as plotly
import pandas as pd
from botocore.client import Config
import ibm_boto3
import numpy as np
import numpy.dual as dual
from sklearn import metrics
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3
def __iter__(self): return 0
#Un-Comment these options if you want to exapand the number of rows and columns of you see visually in the notebook.
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
!rm df_for_export-2.csv
!wget https://raw.githubusercontent.com/shadgriffin/best_cut_off/master/df_for_export-2.csv
pd_data = pd.read_csv("df_for_export-2.csv", sep=",", header=0)
This is a simple sample data set which represents equipment failure for an oil and gas company.
pd_data.head()
For example, in the first record above, for ID 1000003 on 04/05/2016 the probability to fail was .177485 and it did not fail.
The objective is to find the probability cut-off (P_FAIL) that best represents the actual target (dependent variable).
A Gains table is a useful tool for evaluating the accuracy of a model where the predicted variable is binary. A Gains table is easy to explain and extremely effective in determining the fitness of a machine learning model.
dfx=pd_data
#Sort the data by Well_id and Date
dfx=dfx.sort_values(by=["P_FAIL"], ascending=[False])
#add a very small random number to the probability to break ties
dfx['wookie'] = (np.random.randint(0, 100, dfx.shape[0]))/100000000000000000
dfx['P_FAIL']=dfx['P_FAIL']+dfx['wookie']
#Create deciles based on P_FAIL
dfx['DECILE'] = pd.qcut(dfx['P_FAIL'], 10, labels=np.arange(100, 0, -10))
# Find the minimum probability for each decile
tips_summedv = pd.DataFrame(dfx.groupby(['DECILE'])['P_FAIL'].min())
# Find the maximum probability of each decile
tips_summedw = pd.DataFrame(dfx.groupby(['DECILE'])['P_FAIL'].max())
# Find the Actual Failure rate for each decile.
tips_summedx = pd.DataFrame(dfx.groupby(['DECILE'])['FAILURE_TARGET'].mean())
#Sum the number of Failures in each decile.
tips_summedy = pd.DataFrame(dfx.groupby(['DECILE'])['FAILURE_TARGET'].sum())
# count the records in each decile
tips_summedz = pd.DataFrame(dfx.groupby(['DECILE'])['FAILURE_TARGET'].count())
#Aggregate the summaries into one dataframe
tips = pd.concat([tips_summedv,tips_summedw, tips_summedx, tips_summedy,tips_summedz], axis=1)
tips.columns = ['MIN_SCORE','MAX_SCORE','FAILURE_RATE','FAILURES', 'OBS']
tips=tips.sort_values(by=['DECILE'], ascending=[False])
gains=tips
#Find the number of cumulative failures by decile.
gains['CUML_FAILURES']=gains['FAILURES'].cumsum()
#Find the percentage of failures in each decile
gains['PCT_OF_FAILURES']=(gains.FAILURES)/(dfx['FAILURE_TARGET'].sum())*100
#Find the cumulative percentage of failures in each decile.
gains['CUML_PCT_OF_FAILURES']=gains.PCT_OF_FAILURES.cumsum()
#Format the final output
gains=gains[['OBS','MIN_SCORE','MAX_SCORE','FAILURES','FAILURE_RATE','PCT_OF_FAILURES','CUML_FAILURES','CUML_PCT_OF_FAILURES']]
gains
The gains table above shows the relationship between the predicted value (P_FAIL) and the dependent variable (FAILURE_TARGET).
Interpreting the gains table is straight-forward. If the deciles were assigned randomly, you would expect 10% of the failures to fall in each decile. When you use P_FAIL (the predicted probability) to create the deciles, 69% of the failures fall in the top decile. Or, the failure rate in the top decile is 19% (1 in 5). The failure rate in the bottom decile is 0%. All of this means that P_FAIL does a great job of predicting FAILURE_RATE.
Next, we use the concept of a gains table to find the best cut-off.
First, let's define a few things. We just showed how a gains table is used to evaluate the effectiveness of a model. A confusion matrix is another way to gauge the effectiveness of a model.
A confusion matrix uses a cut-off value and then assigns each prediction into a binary yes/no format consistent with your business problem. In this case, we would need to classify each observation as either a predicted failure (1) or a predicted non-failure (0). Once we make this classification, a confusion matrix is simply the cross-tabulation of the column representing the predicted failure/non-failure and the column representing the actual failure/non-failure.
A confusion matrix communicates four different possible outcomes. Again, a cut-off (threshold) is required to build a confusion matrix.
In finding the best cut-off, our goal is to minimize the cost of misclassified predictions and maximize benefit of correctly classified predictions.
In other words, we want to select a cut-off that minimizes the impact of False Positives and False Negatives, while maximizing the impact of True Positives and True Negatives.
Here are few other pertinent definitions:
In our data set:
Again, our object with determining a cut-off is to minimize the cost of False positives and False negatives.
dfx=pd_data
Add a small negative random number to P_FAIL to break ties when grouping.
dfx['wookie'] = (np.random.randint(0, 100, dfx.shape[0]))/100000000000000000
dfx['P_FAIL']=dfx['P_FAIL']+dfx['wookie']
Instead of deciles, we create 10000 groups, based on the probability to fail.
dfx['GROUPS'] = pd.qcut(dfx['P_FAIL'], 10000, labels=False)
# find the minimum P_FAIL for each group. This is a potential cut-off point.
tips_summedb = pd.DataFrame(dfx.groupby(['GROUPS'])['P_FAIL'].min())
#Find the number of Failures in each group
tips_summedz = pd.DataFrame(dfx.groupby(['GROUPS'])['FAILURE_TARGET'].sum())
#find the number of observations in each group
tips_summeda = pd.DataFrame(dfx.groupby(['GROUPS'])['FAILURE_TARGET'].count())
#append the summaries into one dataframe
tips = pd.concat([tips_summedb,tips_summedz, tips_summeda], axis=1)
tips.columns = ['CUT-OFF','FAILURES', 'OBS']
#find the number of non-failures
tips['NON_FAILURES']=tips.OBS-tips.FAILURES
#reset the index to make GROUPS a column
tips.reset_index(level=0, inplace=True)
#sort the dataframe by groups in descending order
tips=tips.sort_values(by=['GROUPS'], ascending=[False])
# Cumulative sum the failures, non-failures and observations
tips['INV_CUM_FAILURES'] = tips.FAILURES.cumsum()
tips['INV_CUM_NON_FAILURES'] = tips.NON_FAILURES.cumsum()
tips['TOTAL_OBS']=tips.OBS.sum()
#Sort the data by Groups ascending
tips=tips.sort_values(by=['GROUPS'], ascending=[True])
#calculate the total number of failures and non-failures
tips['CUM_FAILURES'] = tips.FAILURES.cumsum()
tips['CUM_NON_FAILURES'] = tips.NON_FAILURES.cumsum()
#find the total number of failures for the whole dataset.
tips['TOTAL_FAILURES']=tips.FAILURES.sum()
tips['TOTAL_NON_FAILURES']=tips.NON_FAILURES.sum()
#define the true positives for each cut-off
tips['TRUE_POSITIVES']=tips.INV_CUM_FAILURES
#define the false positives for each cut-off
tips['FALSE_POSITIVES']=tips.INV_CUM_NON_FAILURES
#define the true negatives for each cut-off
tips['TRUE_NEGATIVES']=tips.CUM_NON_FAILURES-tips.NON_FAILURES
#define the false negatives for each cut-off
tips['FALSE_NEGATIVES']=tips.CUM_FAILURES-tips.FAILURES
#double check the logic and arithmetic.
tips['OBS2']=tips.TRUE_POSITIVES+tips.FALSE_POSITIVES+tips.TRUE_NEGATIVES+tips.FALSE_NEGATIVES
# define the sensitvity for each cut-off
tips['SENSITIVITY']=tips['TRUE_POSITIVES']/(tips['TRUE_POSITIVES']+tips['FALSE_NEGATIVES'])
#define the specificity for each cut-off
tips['SPECIFICITY']=tips['TRUE_NEGATIVES']/(tips['FALSE_POSITIVES']+tips['TRUE_NEGATIVES'])
#define the false positive rate for each cut-off
tips['FALSE_POSITIVE_RATE']=1-tips['SPECIFICITY']
#define the false negative rate for each cut-off
tips['FALSE_NEGATIVE_RATE']=1-tips['SENSITIVITY']
tipsx=tips
So, in the table below, the cut-off is in the second column. For example, if you use a cut-off of .0003173, all p-values greater than .003173 are labeled as predicted failures. All p-values less than .003171 are predicted non-failures. Using .003171 as a cut-off means that you will have:
4667 True Positives
160505 False Positives
34 True Negatives
0 False Negatives.
gains=tipsx[['GROUPS','CUT-OFF','TRUE_POSITIVES','FALSE_POSITIVES','TRUE_NEGATIVES','FALSE_NEGATIVES','SENSITIVITY',
'SPECIFICITY','FALSE_POSITIVE_RATE','FALSE_NEGATIVE_RATE']]
gains
In the previous step we created 10,000 potential cut-offs. Now we can determine the cut-off that minimizes the misclassification rate. The first step in this process is to calculate the miscalculation rate for each cut-off.
tips=tipsx
#sum the false positives and false negatives.
tips['FALSE_CLASSIFICATIONS'] = tips.FALSE_POSITIVES+tips.FALSE_NEGATIVES
#estimate the false classification rate
tips['FALSE_CLASSIFICATION_RATE']=tips.FALSE_CLASSIFICATIONS/(tips.TOTAL_OBS)
gains=tips[['GROUPS','CUT-OFF','TRUE_POSITIVES','FALSE_POSITIVES','TRUE_NEGATIVES','FALSE_NEGATIVES','SENSITIVITY',
'SPECIFICITY','FALSE_POSITIVE_RATE','FALSE_NEGATIVE_RATE','FALSE_CLASSIFICATIONS','FALSE_CLASSIFICATION_RATE']]
In this first example, we calculate the simple and unweighted misclassification rate for each cut-off. Then, determine which cut-off has the smallest rate. Note, the assumption is that a false positive has the same value as a false negative. Often, this is not the case.
gains
That's a lot of data. Let's examine the data visually.
x1 = tips['CUT-OFF']
y1 = tips['FALSE_POSITIVE_RATE']
y2 = tips['FALSE_NEGATIVE_RATE']
y3 = tips['FALSE_CLASSIFICATION_RATE']
trace = go.Scatter(
x = x1,
y = y1,
name='False Positive Rate')
trace2 = go.Scatter(
x = x1,
y = y2,
name='False Negative Rate'
)
trace3 = go.Scatter(
x = x1,
y = y3,
name='False Classification Rate'
)
layout = go.Layout(
title='Mis-Classification Rates BY CUT OFF SCORE',
xaxis=dict(
title='CUT OFF SCORE',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
yaxis=dict(
title='False Positive and False Negative Rates',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
showlegend=True,
)
data=[trace,trace2,trace3]
fig = go.Figure(data=data, layout=layout)
#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines')
The chart above suggests that the best cut-off score is around 0.90.
Note that increasing the cut-off leads to more False Negatives and decreasing the cut-off leads to more False Positives.
gains=gains.sort_values(by=['FALSE_CLASSIFICATION_RATE'], ascending=[True])
gains=gains.head(1)
gains
By querying the data, we can see the best cut-off is 0.909427.
We can now use this cut-off to build a confusion matrix.
dfx=pd_data
dfx['Y_FAIL'] = np.where(((dfx.P_FAIL <= .909427)), 0, 1)
print(pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET, dropna=False))
pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET).apply(lambda r: r/r.sum(), axis=1)
Based on this cut-off, we have 356 false positives and 3,982 false negatives.
Now, let's compare these results with a cut-off of .50.
dfx=pd_data
dfx['Y_FAIL'] = np.where(((dfx.P_FAIL <= .5)), 0, 1)
print(pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET, dropna=False))
pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET).apply(lambda r: r/r.sum(), axis=1)
Using a cut-off of .5 means that we would have significantly more false positives and substantially less false negatives.
Note, that you can use the table we created above to examine the cut-off or threshold in the context of an ROC Curve.
x2 = tips['CUT-OFF']
x1 = tips['FALSE_POSITIVE_RATE']
y1 = tips['SENSITIVITY']
trace = go.Scatter(
x = x1,
y = y1,
name='ROC')
trace2 = go.Scatter(
x = x2,
y = y1,
name='Cut-Off v TPR'
)
layout = go.Layout(
title='ROC with Threshold Levels',
xaxis=dict(
title='False Positive Rate and Threshold Level',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
yaxis=dict(
title='True Positive Rate',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
showlegend=True,
)
data=[trace,trace2]
fig = go.Figure(data=data, layout=layout)
#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines')
In the previous example, we assumed the costs of false positives and false negatives were equal. In the real world, this is rarely the case. Take for example, airplane engines. If you have a false positive, then you replace equipment that doesn't need to be replaced. This is not ideal, but compare this to the cost of a false negative. A false negative occurs when your model predicts that an airplane engine will not fail and it does. A false negative literally could mean that a plane falls from the sky and all passengers onboard plunge to their death.
In the next scenario, we'll assume that the exact costs are not known, but we have a rough idea that a false negative is about twice as costly as a false positive.
tips=tipsx
#define the cost of a false positive and false negative
cost_of_a_false_positive=1
cost_of_a_false_negative=2
#convert the costs into weights that sum to one
fp_weight=(cost_of_a_false_positive)/(cost_of_a_false_positive+cost_of_a_false_negative)
fn_weight=(cost_of_a_false_negative)/(cost_of_a_false_positive+cost_of_a_false_negative)
#Create a weighted false classification rate based on the costs of a false positive and a false negative.
tips['FALSE_CLASSIFICATIONS_W'] = np.where(((2*(((fp_weight)*tips.FALSE_POSITIVES+(fn_weight)*tips.FALSE_NEGATIVES)) >= tips.TOTAL_OBS)), tips.TOTAL_OBS, 2*(((fp_weight)*tips.FALSE_POSITIVES+(fn_weight)*tips.FALSE_NEGATIVES)))
tips['FALSE_CLASSIFICATION_RATE_W']=tips.FALSE_CLASSIFICATIONS_W/(tips.TOTAL_OBS)
gains=tips[['GROUPS','CUT-OFF','TRUE_POSITIVES','FALSE_POSITIVES','TRUE_NEGATIVES','FALSE_NEGATIVES','SENSITIVITY',
'SPECIFICITY','FALSE_POSITIVE_RATE','FALSE_NEGATIVE_RATE','FALSE_CLASSIFICATIONS_W','FALSE_CLASSIFICATION_RATE_W']]
gains
There is a lot of data here. Let's examine the data graphically.
x1 = tips['CUT-OFF']
y1 = tips['FALSE_POSITIVE_RATE']
y2 = tips['FALSE_NEGATIVE_RATE']
y3 = tips['FALSE_CLASSIFICATION_RATE_W']
trace = go.Scatter(
x = x1,
y = y1,
name='False Positive Rate')
trace2 = go.Scatter(
x = x1,
y = y2,
name='False Negative Rate'
)
trace3 = go.Scatter(
x = x1,
y = y3,
name='Weighted False Classification Rate'
)
layout = go.Layout(
title='Weighted Mis-Classification Rates BY CUT OFF SCORE',
xaxis=dict(
title='CUT OFF SCORE',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
yaxis=dict(
title='Weighted False Positive and False Negative Rates',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
showlegend=True,
)
data=[trace,trace2,trace3]
fig = go.Figure(data=data, layout=layout)
#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines')
Based on the Chart above, we can see that the best cut-off score is around .88 or so.
gains=gains.sort_values(by=['FALSE_CLASSIFICATION_RATE_W'], ascending=[True])
gains=gains.head(1)
gains
By querying the data, we can see it is precisely .8802.
dfx=pd_data
dfx['Y_FAIL'] = np.where(((dfx.P_FAIL <= .8802)), 0, 1)
print(pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET, dropna=False))
pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET).apply(lambda r: r/r.sum(), axis=1)
We assumed that false negatives were more expensive than false positives. This new assumption means the best answer changes. Specifically, the number of false negatives decreased from 3982 to 3531.
In the last example, we will assume that we have detailed costs for our business problem.
Let's assume the following:
If a false positive occurs, it costs the organization 2,500 in unnecessary repairs. If a false negative occurs, it costs the organization 2,500 in repairs and 25,000 in lost production.
#Define the False Positive and False Negative Weights.
cost_of_a_false_positive=2500
cost_of_a_false_negative=27500
fp_weight=(cost_of_a_false_positive)/(cost_of_a_false_positive+cost_of_a_false_negative)
fn_weight=(cost_of_a_false_negative)/(cost_of_a_false_positive+cost_of_a_false_negative)
#Define the False Classification Rate
tips['FALSE_CLASSIFICATIONS_W'] = np.where(((2*(((fp_weight)*tips.FALSE_POSITIVES+(fn_weight)*tips.FALSE_NEGATIVES)) >= tips.TOTAL_OBS)), tips.TOTAL_OBS, 2*(((fp_weight)*tips.FALSE_POSITIVES+(fn_weight)*tips.FALSE_NEGATIVES)))
tips['FALSE_CLASSIFICATION_RATE_W']=tips.FALSE_CLASSIFICATIONS_W/(tips.TOTAL_OBS)
tips['TOTAL_COST']=tips.FALSE_POSITIVES*cost_of_a_false_positive+tips.FALSE_NEGATIVES*cost_of_a_false_negative
gains=tips[['GROUPS','CUT-OFF','TRUE_POSITIVES','FALSE_POSITIVES','TRUE_NEGATIVES','FALSE_NEGATIVES','SENSITIVITY',
'SPECIFICITY','FALSE_POSITIVE_RATE','FALSE_NEGATIVE_RATE','FALSE_CLASSIFICATIONS_W','FALSE_CLASSIFICATION_RATE_W','TOTAL_COST']]
gains
x1 = tips['CUT-OFF']
y1 = tips['FALSE_POSITIVE_RATE']
y2 = tips['FALSE_NEGATIVE_RATE']
y3 = tips['FALSE_CLASSIFICATION_RATE_W']
trace = go.Scatter(
x = x1,
y = y1,
name='False Positive Rate')
trace2 = go.Scatter(
x = x1,
y = y2,
name='False Negative Rate'
)
trace3 = go.Scatter(
x = x1,
y = y3,
name='Weighted False Classification Rate'
)
layout = go.Layout(
title='Weighted Mis-Classification Rates BY CUT OFF SCORE',
xaxis=dict(
title='CUT OFF SCORE',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
yaxis=dict(
title='False Positive and False Negative Rates',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
showlegend=True,
)
data=[trace,trace2,trace3]
fig = go.Figure(data=data, layout=layout)
#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines')
Let's examine the relationship between Total Costs and the cut-off or threshold level.
x1 = tips['CUT-OFF']
y1 = tips['TOTAL_COST']
trace = go.Scatter(
x = x1,
y = y1,
name='Total Cost')
layout = go.Layout(
title='Cut-Off Levels and Total Cost',
xaxis=dict(
title='CUT OFF SCORE',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
yaxis=dict(
title='Total Costs',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
showlegend=True,
)
data=[trace]
fig = go.Figure(data=data, layout=layout)
#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines')
gains=gains.sort_values(by=['TOTAL_COST'], ascending=[True])
gains=gains.head(1)
gains
Given the actual costs of a false positive and a false negative, the best cut-off is .665541.
dfx['Y_FAIL'] = np.where(((dfx.P_FAIL <= .665541)), 0, 1)
print(pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET, dropna=False))
pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET).apply(lambda r: r/r.sum(), axis=1)
Note that the number of False Positives increases substantially from our previous example.
As you can see from these three scenarios, the best cut-off will depend greatly on the economics of your problem. Because of this, it is important to understand economic costs of both a false positive and a false negative. Just like most things in data science. The "best" answer depends on the context. If a false negative means a bolt will fly off and potentially injure someone, you have to make sure your model predictions reflect this hazard. Data science, like all things in this world, is subject to the context in which it is applied.
Shad Griffin, is a Data Scientist at the IBM Global Solution Center in Dallas, Texas